Extracting Key Terms from Chinese and Japanese texts

نویسنده

  • Pascale Fung
چکیده

Key term extraction is very useful for information retrieval. Most term extraction methods use one of two approaches, namely lexical and grammatical. We argue that due to the diierences in linguistic and character set characteristics of Chinese and Japanese, a lexical approach is more suitable for Chinese whereas a grammatical approach is more suitable for Japanese. In this paper, we present two simple yet powerful systems for Chinese and Japanese key term extraction|CXtract and JBrat. CXtract uses predominantly statistical lexical information to nd term boundaries in large text. JBrat is based on morphosyn-tactic information of the Japanese character sets for terms. Evaluation results show that CXtract has a 80.24% average precision in term extraction, and JBrat has a 88.07% average precision.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

A Corpus-based Machine Translation Method of Term Extraction in LSP Texts

To tackle the problems of term extraction in language specific field, this paper proposes a method of coordinating use of corpus and machine translation system in extracting terms in LSP text. A comparable corpus built for this research contains 167 English texts and 229 Chinese texts with around 600,000 English tokens and 900,000 Chinese characters. The corpus is annotated with mega-informatio...

متن کامل

Collecting Bilingual Technical Terms from Patent Families of Character-Segmented Chinese Sentences and Morpheme-Segmented Japanese Sentences

In manual translation of patent documents, a technical term bilingual lexicon is inevitable for a translator to efficiently translate patent documents. Dong et al. (2015) proposed a method of generating bilingual technical term lexicon from morpheme-segmented parallel patent sentences. The proposed method estimates Japanese-Chinese translation of technical terms using the phrase translation tab...

متن کامل

Enhancing Text Representation for Classification Tasks with Semantic Graph Structures

To represent the textual knowledge more expressively, a kind of semanticbased graph structure is proposed, in which more semantic and ordering information among terms as well as the structural information of the text are incorporated. Such model can be constructed by extracting representative terms from texts and their mutually semantic relationships. Afterward, it is represented as a graph, wh...

متن کامل

Extracting Recurrent Phrases and Terms from Texts Using a Purely Statistical Method

Most statistical measures for extracting interesting word pairs such as MI and t-score require a large corpus to work well. This paper evaluates some of the most widely used statistical measures and introduces a method that can identify significant bigrams in relatively small texts by adapting Fung and Church's (1994) K-vec algorithm, which was originally designed to extract word correspondence...

متن کامل

A system for generating user's chronological interest space from web browsing history

We propose a method that helps users to understand their own interests by extracting terms from selected link texts and generating a new browsing history, and arranging those terms and Web-page icons on to the user's interest space in chronological order. We have implemented a prototype system based on this method. The system's performance was evaluated in two experiments, which revealed that (...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 1998